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Abstract 

Needlets have been recognized as state-of-the-art tools to tackle spherical data, dne to 
their excellent localization properties in both spacial and freqnency domains. This paper 
considers developing kernel methods associated with the needlet kernel for nonparametric 
regression problems whose predictor variables are defined on a sphere. Dne to the local¬ 
ization property in the freqnency domain, we prove that the regnlarization parameter of 
the kernel ridge regression associated with the needlet kernel can decrease arbitrarily fast. 
A natnral consequence is that the regularization term for the kernel ridge regression is 
not necessary in the sense of rate optimality. Based on the excellent localization property 
in the spacial domain further, we also prove that all the (0 < g < 2) kernel regulariza¬ 
tion estimates associated with the needlet kernel, including the kernel lasso estimate and 
the kernel bridge estimate, possess almost the same generalization capability for a large 
range of regularization parameters in the sense of rate optimality. This finding tentatively 
reveals that, if the needlet kernel is utilized, then the choice of q might not have a strong 
impact in terms of the generalization capability in some modeling contexts. From this 
perspective, q can be arbitrarily specified, or specihed merely by other no generalization 
criteria like smoothness, computational complexity, sparsity, etc.. 
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1. Introduction 


Contemporary scientific investigations frequently encounter a common issue of ex¬ 
ploring the relationship between a response variable and a number of predictor variables 
whose domain is the surface of a sphere. Examples include the study of gravitational 


phenomenon 


12j, cosmic microwave background radiation 


10| . tectonic plat geology [6| 


and image rendering 3^. As the sphere is topologically a compact two-point homoge¬ 
neous manifold, some widely used schemes for the Euclidean space such as the neural 


networks |l^ and support vector machines [3^ are no more the most appropriate meth¬ 
ods for tackling spherical data. Designing efficient and exclusive approaches to extract 
useful information from spherical data has been a recent focus in statistical learning 


0,0, fl 


3l|. 


Recent years have witnessed considerable approaches about nonparametric regression 
for spherical data. A classical and long-standing technique is the orthogonal series meth¬ 


ods associated with spherical harmonics jl|, with which the local performance of the 
estimate are quite poor, since spherical harmonics are not well localized but spread out 
all over the sphere. Another widely used technique is the stereographic projection meth¬ 
ods 0], in which the statistical problems on the sphere were formulated in the Euclidean 
space by use of a stereographic projection. A major problem is that the stereographic 
projection usually leads to a distorted theoretical analysis paradigm and a relatively so¬ 


phisticate_statistical behavior. Localiza^on methods, such as the Nac 
estimate 


3l[ |. local polynomial estimate [Sj] and local linear estimate 


araya-Watson-like 


2l| are also alternate 


and interesting nonparametric approaches. Unfortunately, the manifold structure of the 

I also developed a 


sphere is not well taken into account in these approaches. Mihn 
general theory of reproducing kernel Hilbert space on the sphere and advocated to utilize 
the kernel methods to tackle spherical data. However, for some popular kernels such as 


the Gaussian [27[ and polynomials ^ , kernel methods suffer from either a similar problem 
as the localization methods, or a similar drawback as the orthogonal series methods. In 
fact, it remains open that whether there is an exclusive kernel for spherical data such that 
both the manifold structure of the sphere and the localization requirement are sufficiently 
considered. 
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Our focus in this paper is not on developing a novel technique to cope with spheri¬ 
cal nonparametric regression problems, but on introducing an exclusive kernel for kernel 
methods. To be detailed, we aim to hnd a kernel that possesses excellent spacial local¬ 
ization property and makes fully use of the manifold structure of the sphere. Recalling 
that one of the most important factors to embody the manifold structure is the special 
frequency domain of the sphere, a kernel which can control the frequency domain freely is 
preferable. Thus, the kernel we need is actually a function that possesses excellent local¬ 
ization properties, both in spacial and frequency domains. Under this circumstance, the 


needlet kernel comes into our sights. Needlets, introduced by Narcowich et ah 


29 


30|, 


are a new kind of second-generation spherical wavelets, which can be shown to make up a 


tight frame with both perfect spacial an_d 
needlets have a clear statistical nature 


requency localization properties. Furthermore, 


15| , the most important of which is that in the 


Gaussian and isotropic random helds, the random spherical needlets behave asymptoti¬ 


cally as an i.i.d. array |2|]. It can be found in 29| that the spherical needlets correspond a 
needlet kernel, which is also well localized in the spacial and frequency domains. Conse¬ 


quently, the needlet kerne 
compressible property 
3.10]. 


29 


is proved to possess the reproducing property 


29 


Theorem 3.7] and best approximation property 


Lemma 3.8], 
3, Corollary 


The aim of the present article is to pursue the theoretical advantages of the needlet 
kernel in kernel methods for spherical nonparametric regression problems. If the kernel 
ridge regression (KRR) associated with the needlet kernel is employed, the model selec¬ 
tion then boils down to determining the frequency and regularization parameter. Due 
to the excellent localization in the frequency domain, we hnd that the regularization pa¬ 
rameter of KRR can decrease arbitrarily fast for a suitable frequency. An extreme case 
is that the regularization term is not necessary for KRR in the sense of rate optimality. 
This attribution is totally different from other kernels without good localization property 
in the frequency domain 8], such as the Gaussian and Abel-Poisson jl^ kernels. 
We attribute the above property as the hrst feature of the needlet kernel. Besides the 
good generalization capability, some real world applications also require the estimate to 


possess the smoothness, low computational complexity and sparsity 


32| . This guides us 
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to consider the Iq {0 < q < 2) kernel regularization (KRS) schemes associated with the 
needlet kernel, including the kernel bridge regression and kernel lasso estimate j^. The 
first feature of the needlet kernel implies that the generalization capability of all /g-KRS 
with 0 < g < 2 are almost the same, provided the regularization parameter is set to be 
small enough. However, such a setting makes there be no difference among all /g-KRS 
with 0 < g < 2, as each of them behaves similar as the least squares. To distinguish 
different behaviors of the /g-KRS, we should establish a similar result for a large regular¬ 
ization parameter. By the aid of a probabilistic cubature formula and the the excellent 
localization property in both frequency and spacial domain of the needlet kernel, we hud 
that all /'^-KRS with 0 < g < 2 can attain almost the same almost optimal generalization 
error bounds, provided the regularization parameter is not larger than Here m 

is the number of samples and e is the prediction accuracy. This implies that the choice of 
g does not have a strong impact in terms of the generalization capability for /'^-KRS, with 
relatively large regularization parameters depending on g. From this perspective, g can be 
specified by other no generalization criteria like smoothness, computational complexity 
and sparsity. We consider it as the other feature of the needlet kernel. 

The reminder of the paper is organized as follows. In the next section, the needlet 
kernel together with its important properties such as the reproducing property, compress¬ 
ible property and best approximation property is introduced. In Section 3, we study the 
generalization capability of the kernel ridge regression associated with the needlet ker¬ 
nel. In Section 4, we consider the generalization capability of the kernel regularization 
schemes, including the kernel bridge regression and kernel lasso. In Section 5, we provide 
the proofs of the main results. We conclude the paper with some useful remarks in the 
last section. 

2. The needlet kernel 

Let be the unit sphere embedded into For integer k > 0, the restriction 

to of a homogeneous harmonic polynomial of degree k on the unit sphere is called a 
spherical harmonic of degree k. The class of all spherical harmonics of degree k is denoted 
by Hf, and the class of all spherical harmonics of degree k < n is denoted by H^. Of 


4 


course, nj = ®Lo and it comprises the restriction to 8“^ of all algebraic polynomials 
in d + 1 variables of total degree not exceeding n. The dimension of is given by 

Mfiryy A.>i; 


Dt := dim H! = 


k-\-d- 


k = 0, 


and that of is ~ 

The addition formula establishes a connection between spherical harnomics of degree 
k and the Legendre polynomial |f^ : 


Di 


Dt 


Y,\ux)Yu^') = 


1=1 


( 2 . 1 ) 


where is the Legendre polynomial with degree k and dimension d + 1. The Legendre 
polynomial P^'^^ can be normalized such that = 1, and satishes the orthogonality 


relations 




>-1 




^k,j ) 


where dkj is the usual Kronecker symbol. 

The following Funk-Hecke formula establishes a connection between spherical harmon¬ 
ics and function 0 G L^([—1,1]) jl2 | 


where 


(j){x ■ x')Hk{x')duj{y) = k)Hk{x), 


B{4>,k) = IS'^ ‘I / 


( 2 . 2 ) 


'-1 


A function r] is said to be admissible 
condition: 


30| if ?7 e C'°°[0, cx)) satishes the following 


suppr; C [0,2], 77 (f) = 1 on [0,1], and 0 < 77 (f) < 1 on [1,2]. 


The needlet kernel 29| is then dehned to be 


/ u\ nd 

Kn{x = 


A:=0 


(2.3) 


le needlets can be deduced from the needlet kernel and a spherical cubature formula 


16 


23i. We refer the readers to 


, ll5|, l29| for a detailed description of the needlets. 
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According to the definition of the admissible fnnction, it is easy to see that Kn possess 
excellent localization property in the freqnency domain. The following Lemma 12.11 that 
can be found in 291 and jj] yields that also possesses perfect spacial localization 
property. 

Lemma 2.1. Let rj be admissible. Then for every k > 0 and r > 0 there exists a constant 
C depending only on fc, r, d and rj such that 


——KJcos 9) 

dr ^ ^ 


< C 


n 


d-\-2r 


(1 + n6)’^ 


, 9 e [0,7r]. 


For / G we write 


Kn*f{0 ■= [ Kn{x ■ x')f{x)duj{x'). 

Js'^ 

We also denote by Ei\f{f)p the best approximation error of / G Lp(S'^) (p > 1) from Ll^, 

i.e. 

E^{f)p:= inf ||/-P|U,(s.). 

p&K 

Then the needlet kernel Kn satisfies the following Lemma [221 which can be deduced from 
291 . 

Lemma 2.2. Kn is a reproducing kernel for that is Kn * P = P for P G LIJ^. 
Moreover, for any f G Lp(S'^), 1 < p < oo, we have Kn * / G Ll^^, and 

\\Kn * /IIlp(S^) < C^||/||lp(S^), and \\f - Kn * /|Up(s.) < CEn{f)p, 
where C is a constant depending only on d,p and rj. 


It is obvious t 
Mercer theorem 


lat Kn is a semi-positive definite kernel, thus it follows from the known 
26l | that Kn corresponds a reproducing kernel Hilbert space (RKHS), 


TLk- 


Lemma 2.3. Let Kn be defined above, then the reproducing kernel Hilbert space associated 
with Kn is the space H^^ with the inner product: 

oo P" 

k=0 j=l 


where fkj = f{x)Ykj{x)du{x). 
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3. Kernel ridge regression associated with the needlet kernel 


In spherical nonparametric regression problems with predictor variables X G df = S'^ 
and response variables V G 3^ C R, we observe m i.i.d. samples from 

an unknown distribution p. Without loss of generality, it is always assumed that y C 
[—M, M] almost surely, where M is a positive constant. One natural measurement of the 
estimate / is the generalization error. 


8{f) := / (/(X) - Yfdp, 

J z 

which is minimized by the regression function dehned by 


f,{x) ;= [ YdpiY\x). 

Jy 

Let be the Hilbert space of px square integrable functions, with norm || ■ ||p. In the 
setting of fp G it is well known that, for every / G there holds 


^(/)-^(/p) = ll/-/pllp 


(3.1) 


We formulate the learning problem in terms of probability rather than expectation. 
To this end, we present a formal way to measure the performance of learning schemes 
in probability. Let 0 C and A1(0) be the class of all Borel measures p such that 
fp E Q. For each e > 0, we enter into a competition over all estimators based on m 
samples : z i—)■ /^ by 


AC,^(0,£):= inf sup P™{z : ||/p -/^||J > e}. 

/zetm p£M{0) 


As it is impossible to obtain a nontrivial convergence rate wtihout imposing any re¬ 


striction on the distribution p {l^. Chap.3], we should introduce certain prior information. 
Let /i > 0. Denote the Bessel-potential Sobolev class Wr (25| to be all / such that 


where 


Wr 


j^(k+(d-i)/2rp,f 


k=0 


< 1 , 


Dt 

Pif = J2{f,Ykj)Y,j. 
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It follows from the well known Sobolev embedding theorem that Wr C C'(S‘^), provided 
r > d/2. In onr analysis, we assnme fp G Wr- 

The learning scheme employed in this section is the following kernel ridge regression 
(KRR) associated with the needlet kernel 


/z,A := arg mn <j - - Vif + A||/||^^ 

1 = 1 


(3.2) 


Since y G [—M, M], it is easy to see that < £{f) for arbitrary / G where 

ttmu := min{M, \u\}sgn{u) is the trnncation operator. As there isn’t any additional 
compntation for employing the trnncation operator, the trnncation operator has been 


nsed in large amonnt of papers, to jnst name a few. 


, y, y, ii8, 


26 


37 


38|. The 


following Theorem 13.11 illnstrates the generalization capability of KRR associated with 
the needlet kernel and reveals the hrst featnre of the needlet kernel. 


Theorem 3.1. Let fp G with r > d/2, m G N, e > 0 &e any real number, and 

n ~ If /z,A is defined as in Hd. iH) with 0 < A < M~‘^e, then there exist positive 

constants Ct, i = 1,..., 4, depending only on M, p, and d, Eq > 0 and £-,£+ satisfying 

C^m-^r/{2r+d) <e_<£^< C'2(m/logm)-2fo(2r-+d)^ (3 3) 

such that for any £ < £_, 

snp P”*{z ; ll/p - TTMfzpWl > £}> ACm(lRr,e) > £o, (3.4) 

fpEWr 

and for any £ > £+, 

g-Came < aC^( 1K„ e) < snp P”^{z : ||/p - 7rM/z,A||p > 4 < (3.5) 

fpEWr 

We give several remarks on Theorem 13.II below. In some real world applications, there 
are only m data available, and the pnrpose of learning is to produce an estimate with the 
prediction error at most £ and statisticians are required to assess the probability of success. 
It is obvious that the probability depends heavily on m and e. If m is too small, then there 
isn’t any estimate that can hnish the learning task with small e. This fact is quantitatively 
verihed by the inequality fl3.4p . More specihcally, fl3.4p shows that if the learning task 
is to yield an accuracy at most £ < £_, and other than the prior knowledge, fp G Wr, 
there are only m < available, then all learning schemes, including KRR 




















associated with the needlet kernel, may fail with high probability. To circumvent it, the 
only way is to acquire more samples, just as inequalities fl3.5p purport to show. fl3.5p 
says that if the number of samples achieves then the probability of success of 

KRR is at least 1 — . The first inequality (lower bound) of (13.5p implies that this 

confidence can not be improved further. The values of £_ and £+ thus are very critical 
since the smallest number of samples to finish the learning task lies in the interval [e_, e+]. 
Inequalities (13. 3 p depicts that, for KRR, there holds 


This implies that the interval [£_,£+] is almost the shortest one in the sense that up to a 
logarithmic factor, the upper bound and lower bound of the interval are asymptotically 
identical. Furthermore, Theorem 13.11 also presents a sharp phase transition phenomenon 
of KRR. The behavior of the conhdence function changes dramatically within the critical 
interval [e:_,e+]. It drops from a constant eg fo an exponentially small quantity. All the 
above assertions show that the learning performance of KRR is essentially revealed in 
Theorem 13.11 

An interesting hnding in Theorem 13.11 is that the regularization parameter of KRR 
can decrease arbitrarily fast, provided it is smaller than M~‘^e. The extreme case is that 
the least-squares possess the same generalization performance as KRR. It is not surprised 
in the realm of nonparametric regression, due to the needlet kernel’s localization property 
in the frequency domain. Via controlling the frequency of the needlet kernel, T-Lk is 
essentially a linear space with finite dimension. Thus, {3, Th.3.2& Th.11.3] together 
with Lemma 15.11 in the present paper automatically yields the optimal learning rate of 
the least squares associated with the needlet kernel in the sense of expectation. Differently, 


Theorem 13.1 


presents an exponential conhdence estimate for KRR, which together with 


fl3.3p makes |14J . Th.11.3] be a corollary of Theorem 13.11 Theorem 13.11 also shows that 
the purpose of introducing regularization term in KRR is only to conquer the singular 
problem of the kernel matrix, A := {Kn{xi ■ since m > in our setting. 

Under this circumstance, a small A leads to the ill-condition of the matrix A + mXI and 
a large A conducts large approximation error. Theorem 13.11 illustrates that if the needlet 
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kernel is employed, then we can set A = M~‘^e to guarantee both the small condition 
number of the kernel matrix and almost generalization error bound. From fl3.3p . it is easy 
to deduce that to attain the optimal learning rate the minimal eigenvalue of 

the matrix A + mXI is which can guarantee that the matrix inverse technique 

is suitable to solve fl3.2p . 


4. kernel regularization schemes associated with the needlet kernel 

In the last section, we analyze the generalization capability of KRR associated with 
the needlet kernel. This section aims to study the learning capability of the kernel 
regularization scheme (KRS) whose hypothesis space is the sample dependent hypothesis 
space associated with iFn(-, •) , 

{ m 

^ ^ •) . flj G R 


The corresponding Z'^-KRS is dehned by 


/z,A,g e arg min 

J ^ ^K,2 


m 


Y^^Uxi) - Vi? + AfiJ(/) 


(4.1) 


i=l 


where 


^lU) ■= inf V] loil^ for / = V] ttiKnixi, 


i=l i=l 

With different choices of the order q, (14.11) leads to various specihc forms of the Iq 


regularizer. fz,\,2 corresponds to the kernel ridge regression 32|, which smoothly shrinks 
the coefficients toward zero and /z,a,i leads to the LASSO [3^, which sets small coefficients 
exactly at zero and thereby also serves as a variable selection operator. The varying forms 
and properties of fz,\,g make the choice of order q crucial in applications. Apparently, an 
optimal q may depend on many factors such as the learning algorithms, the purposes of 
studies and so forth. The following Theorem 14. ll shows that if the needlet kernel is utilized 
in /"^-KRS, then q may not have an important impact in the generalization capability for 
a large range of regularization parameters in the sense of rate optimality. 

Before setting the main results, we should at hrst introduce a restriction to the 
marginal distribution px- Let J be the identity mapping 

- ^2 /H>dA 


px 
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and Dp^ = || J||. Dp^ is called the distortion of px (with respect to the Lebesgue measure) 
, which measures how much px distorts the Lebesgue measure. 


Theorem 4.1. Let fp G Wr with r > d/2, Dp^ < oo, m G N, e > 0 he any real number, 
and n ~ If fz,\,q defined as in ra with \ < '^e and 0 < q < 2, then there 

exist positive constants Ci, i = 1,... ,4, depending only on M, p, q and d, Eq > 0 and 
satisfying 

C^^-2rl{2r+d) <£-<£+< C'2(m/ log m) , (4.2) 


such that for any e < e^. 


fpGWr 

and for any £ > 


sup P™{z : ll/p - TTMfz,x,q\\l > e} > AC^(1L;, e) > Eo, 


—Csme 


< ACm{Wr,£) < sup P™{Z ; ll/p - IlMfzXgWl > s} < e 


—CADpIms 


fpEWr 


(4.3) 


(4.4) 


Compared with KRR fl3.2l) . a common consensus is that /'^-KRS fl4.ip may bring 
a certain additional interest such as the sparsity for suitable choice of q. However, it 
should be noticed that this assertion may not always be true. This conclusion depends 
heavily on the value of the regularization parameter. If the the regularization parameter is 
extremely small, then /“^-RRS for any q G (0, 2] behave similar as the least squares. Under 
this circumstance. Theorem 14.11 obviously holds due to the conclusion of Theorem 13.11 
To distinguish the character of /'^-KRS with different g, one should consider a relatively 
large regularization parameter. Theorem 14.11 shows that for a large range of regularization 
parameters, all the /'^-KRS associated with the needlet kernel can attain the same, almost 
optimal, generalization error bound. It should be highlighted that the quantity m^~^£ is, 
to the best of knowledge, almost the largest value of the regularization parameter among 
all the existing results. We encourage the readers to compare our result with the results 
37l |. Furthermore, we hnd that m^~^£ is sufficient to embody the feature of 


m 


18 


331, 




kernel regularization schemes. Taking the kernel lasso for example, the regularization 
parameter derived in Theorem 14.11 asymptotically equals to e. It is to see that, to yield a 
prediction accuracy £, we have 

/z.A,i e arg niin i — ^(/(x^) - Pifi + AH^(/) 
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and 

- m 

<e. 

i=l 

According to the strnctnral risk minimization principle and X = e, we obtain 

< C. 

Intnitively, the generalization capability of /'^-KRS (14. ip with a large regnlarization 
parameter may depend on the choice of q. While from Theorem 14.11 it follows that the 
learning schemes defined by fl4.ip can indeed achieve the same asymptotically optimal 
rates for all q G (0, cxo). In other words, on the premise of embodying the featnre of Z'^- 
KRS with different q, the choice of q has no inflnence on the generalization capability in 
the sense of rate optimality. Thus, we can determine q by taking other non-generalization 
considerations such as the smoothness, sparsity, and computational complexity into ac¬ 
count. Finally, we explain the reason for this phenomenon by taking needlet kernel’s 
perfect localization property in the spacial domain into account. To approximate fp{x), 
due to the localization property of K^, we can construct an approximant in T-Lz^k with a 
few Kn{xi, xYs whose centers x* are near to x. As fp is bounded by M, then the coefficient 
of these terms are also bounded. That is, we can construct, in a good approximant, 

whose norm is bounded for arbitrary 0 < q < oo. Then, using the standard error de¬ 
composition technique in that divide the generalization error into the approximation 
error and sample error, the approximation error of /"^-KRS is independent of q. For the 
sample error, we can tune A that may depend on q to offset the effect of q. Then, a 
generalization error estimate independent of q is natural. 

5. Proofs 

In this section, we present the proof of Theorem 13.11 and Theorem 14.11 respectively. 

5.1. Proof of Theorem \3.1\ 

For the sake of brevity, we set /„ = Kn * fp. Let 

S{X,m,n) := {Si-nMfz^) - i^z(vrM/z,A) + £z{fn) - S{fn)} ■ 


12 










Then it is easy to deduce that 


-^ifp) < S{X,m,n) + Vn{X), 


(5.1) 


where 'D„(A) := \\fn - fp\\l + A||/„|||-^. If we set := - yf - {fp{x) - yf, 

and ^2 := {fn{x) - yf - {fp{x) - yf, then 

E(^i) = [ ^i{x,y)dp = £{ 7 iM{fz,x){x)) - £{fp), and £(^2) = ^(/n) - ^(/p). 

Jz 

Therefore, we can rewrite the sample error as 

S{\,m,n) = I EKi) - I + I - - E(&) | =: 5, +52- (5.2) 


2=1 


2=1 


The aim of this subsection is to bound Vn{X), Si and S 2 , respectively. To bound 


Vn{X), we need the following 


that can be deduced from [25, 


wo lemmas. The hrst one is the Jackson-type inequality 


29|] and the second one describes the RKHS norm of /„. 


Lemma 5.1. Let f G Wr- Then there exists a constant depending only on d and r such 
that 

\\f-fn\\<Cn-^\ 

where || • || denotes the uniform norm on the sphere. 

Lemma 5.2. Let fn be defined as above. Then we have 


Proof. Due to the addition formula fl2.ip . we have 

Dj 


Kn{x ■y) = ^r] 
k=0 


'^Ykj{x)Ykj{y) 

fc =0 


i=i 




Since 


Kn*fix)= Knix ■y)f{y)du{y), 
Js<^ 


it follows from the Funk-Hecke formula (12.2p that 


Kn*fu,v = 


ISd 


Kn * f{x)Yu,v{x)du{x) = 



S'* Jsd- 


Kn{x ■ x')f{x')d(jj{x')Yu,v{.x)du{x) 


Isd 


f{x') / Kn{x ■ x')Yu,v{x)duj{x)duj{x') 


'S'* 


^d—l I 


/s^ 

^d—1 1 


(f) \t){l - dtYy,^^{x')f{x')duj{x') 


1-1 

pi 


= \S^-Vu,.J ^K4t)P^^\t)il-P)^dt. 
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Moreover, 





/ I ^ d 

fc =0 ' ' 

:gl IS-*! _ f«') ^ 

Vn/ |S'^| IS'^ vn/ |S'^ ^| 


Therefore, 


This implies 


* fu,v = V [-J fu,v 


U 


Kn*f\\ 



< 


n 





u=0 v=l 


1 "" 

tJ=l 

< ll/lli.(s^) < M\ 


The proof of Lemma 15.21 is completed. ■ 

Based on the above two lemmas, it is easy to deduce an upper bound of T>n(A). 


Proposition 5.3. Let f E Wr- 

and d such that 


There exists a positive constant C depending only on r 

Vn{X) < Cn-^^ + M^X 


In the rest of this subsection, we will bound iSi and S 2 respectively. The approach 
used here is somewhat standard in learning theory. S 2 is a typical quantity that can 
be estimated by probability inequalities. We shall bound it by the following one-side 
Bernstein inequality 


Lemma 5.4. Let ^ be a random variable on a probability space Z with mean E(,^), vari¬ 
ance cr^(^) = ©. If 1 ^( 2 :) — E(^)| < Mg for almost all z E Z . then, for all £ > 0, 

p” “p {- 2p/rE) } ■ 

By the help of the above lemma, we can deduce the following bound of 1 S 2 . 
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Proposition 5.5. For every 0 < 5 < 1, with confidence at least 


1 — exp 


3me‘^ 


4SAP (2||/„-/,||5 + £) 


there holds 


m 


2=1 


Proof. It follows from Lemma [2]2] that ||/n||oo < which together with \fp{x)\ < M 
yields that 

161 < (ll/n||oo + M)(||/„||oo + M) < 4Ml 

Hence |6 ~ -^(6)1 6 8M^. Moreover, we have 

Eie2) = mfniX) - fp{Xf X iUX) -Y) + {fp{X) - Y)f) < - /pllp, 


which implies that 


a\^,)<E{e,)<l6M^U-fX. 


Now we apply Lemma [5.41 to 6- It asserts that for any t > 0, 


m 


5 ^ 660 -E( 6 )<t 


2=1 


with conhdence at least 


1 — exp 


mF 


> 1 — exp 


3mF 


48AP (2\\fn-fA\l + t)l' 


2p=(6) + |A«) 

This implies the desired estimate. ■ 

It is more difficult to estimate iSi because 6 involves the sample z through f^^x. We 
will use the idea of empirical risk minimization to bound this term by means of covering 
number [^. The main tools are the following three lemmas. 

Lemma 5.6. Let Vk be a k-dimensional function space defined on S^. Denote by ttmLa, = 
{ttm/ : / e 14}- Then 

M 

XogU{TiMVk.rj) < cfclog—, 

T] 

where c is a positive constant and Af{nMVk, rj) is the covering number associated with the 
uniform norm that denotes the number of elements in least p-net of iiMVk- 
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Lemma 15.61 is a direct result through combining 19|, Property 1] and 20|, P.437]. 


It shows that the covering number of a bounded functional space can be also bounded 
)roperly. The following ratio probability inequality is a standard result in learning theory 
7||. It deals with variances for a function class, since the Bernstein inequality takes care 
of the variance well only for a single random variable. 

Lemma 5.7. Let Q be a set of functions on Z such that, for some c > 0, \g — E( 5 f)| < B 
almost everywhere and < cFi{g) for each g E Q. Then, for every £ > 0, 


sup 

/ee 


n9)-iET=,9iz^) 


> \/i > < s) exp 


me 


^E(g) + s 

Now we are in a position to give an upper bound of 1 S 2 . 
Proposition 5.8. For all £ > 0, 


2c + 


2B 


Si < -S{fp) + e 


holds with confidence at least 


1 — exp 



3me 1 
128 M 2 j ■ 


Proof. Set 


B := {(/(X) - F)2 - (/,(X) -Yf-.fe tim-Hk}. 

Then for g E there exists / G TLk such that g{Z) = (TiMfiX) — Y)^ — (/p(X) — Y^. 
Therefore, 

^ m 

E(^) = -S{fp) > 0 , —'^g{zi) = S^{7iM{f)) -S^ifp)- 

i=\ 

Since Ivtm/I < M and \fp{X)\ < M almost everywhere, we hnd that 

\g{z)\ = |(7rM/(X) - fpiXMTTMfiX) -Y) + (/,(X) - F))| < 8M^ 
almost everywhere. It follows that \g{z) — E( 5 f)| < 16M^ almost everywhere and 

E{g‘^) < im^hMf - fpWl = im^E{g). 
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Now we apply Lemma 15.71 with B = c = 16M^ to the set of functions B and obtain 


that 


{£(/)-£(/.)}-{g.(/)-£.(/.)} ^ (5,3, 

J^-kmUk V {^(/) “ ^(/p)} + ^ 96-^ \/E(5f) + £ 


with conhdence at least 


l-A/-(.F,£)exp|-^} 


Observe that for gi,g 2 & B there exist fi, f 2 G tim'Hk such that 

g,{Z) = {f,{X) - Y)^ - iUX) - F)^ j = 1, 2. 

In addition, for any / G tim'Hk, there holds 

| 9 i(Z) - g^(Z)\ = |(/i(V) - K)2 - {h{X) - Yf\ << 4M||/. - 

We see that for any e > 0, an (^)-covering of ttm'Hk provides an e-covering of X. 
Therefore 

■AA {X, e) <J\f • 


Then the conhdence is 
1 — A/’( e) exp 

Since 


3me I 

' 128M2 J 


> 1 — J\f (ttm'Hk, 


AM 


3me 1 


-8{fp)} + e < -^Up)} + 


it follows from fl5.3l) and Lemma 15.61 that 


^2 < -S{7lMfz,x) - S ifp) + £ 


holds with conhdence at least 

1 — exp <( cn'^ log 


^ AM^ 3me ] 


128M2 J ■ 

This hnishes the proof. ■ 

Now we are in a position to deduce the hnal learning rate of the kernel ridge regression 
fl3.2p . Firstly, it follows from Propositions 15.3115.51 and l5^ that 

S{7iMUx)-S{fp)) < Vn{X)+Si+S2<C{n-^^ + XM^) 
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holds with conhdence at least 


1 — exp < cn log 


^4M^ 3m£ ] 


128 M 2 J 


— exp 


3me^ 


48M2 (2\\fn-ai + e) 

Then, by setting e > > C{m/ log m )n = and A < we get, 

with confidence at least 

1 — exp{—Cme:}, 

there holds 

-^{fp) < 4e. 

The lower bound can be more easily deduced. Actually, it can be easily deduced from 
the Chapter 3 of |9| that for any estimator fz G there holds 


sup Pm{z : ll/z - fpWl >£}> 

fp&Wr 


Eq, e < 


e £ > e_. 


where Eq = ^ and e_ = cm 2 r/( 2 r+d) some universal constant c. With this, the proof 
of Theorem 13.11 is completed. 


5.2. Proof of Theorem \4. 1\ 

Before we proceed the proof, we at first present a simple description of the methodol¬ 
ogy. The methodology we adopted in the proof of Theorem 14.11 seems of novelty. Tradi¬ 
tionally, the generalization error of learning schemes in the sample dependent hypothesis 
space (SDHS) is divided into the approximation, hypothesis and sample errors (three 
terms) j^. All of the aforementioned results about coefficient regularization in SDHS 
fall into this style. According to j^, the hypothesis error has been regarded as the re¬ 
flection of nature of data dependence of SDHS, and an indispensable part attributed to 
an essential characteristic of learning algorithms in SDHS, compared with the learning 
schemes in SIHS (sample independent hypothesis space). With the needlet kernel Kn, 
we will divide the generalization error of P kernel regularization into the approximation 
and sample errors (two terms) only. The core tool is needlet kernel’s excellent localiza¬ 
tion properties in both the spacial and frequency domain, with which the reproducing 
property, compressible property and the best approximation property can be guarantee. 
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After presenting a probabilistic cubature formula for spherical polynomials, we can prove 
that all the polynomials can be represented by via the SDHS. This helps us to deduce 
the approximation error. Since ^ the bound of the sample error is as same as 
that in the previous subsection. Thus, We divide the proof into three parts. The hrst one 
devotes to establish the probabilistic cubature formula. The second one is to construct 
the random approximant and study the approximation error. The third one is to deduce 
the sample error and derive the hnal learning rate. 


To present the probabilistic cubature formula, we need the fo 


hrst one is the Nikolskii inequality for spherical polynomials 
Lemma 5.9. Let 1 < p < q < oo, n > 1 he an integer. Then 


22 |. 


lowing two lemmas. The 


\\Q\\l<i{S'^) ^ Cup ||<5||lp(S‘*), Q ^ 

where the constant C depends only on d. 

To state the next lemma, we need introduce the following dehnitions. Let V be a 
hnite dimensional vector space with norm || • ||v, and W C V* be a hnite set. Here V* 
denotes the dual space of V. We say that W is a norm generating set for V if the mapping 
Tjy : V —)■ dehned by Tu{x) = {u{x))u&a is injective, where CardiJA) is the 

cardinality of the set Li and Tu is named as the sampling operator. Let W := TuiV) be 
the range of Tu, then the injectivity of Tjj implies that T^^ : W —?• V exists. Let 
have a norm || ■ ||j^card(w), with || ■ ||j^card(w)* being its dual norm on Equipping W 

with the induced norm, and let ||T^^|| := ||T^^||>v^v- In addition, let /C+ be the positive 
cone of RC'ard(w). jc all (r„) G for which > 0. Then the following Lemma 

Ib.ini can be found in |23|. 


Lemma 5.10. Let U he a norm generating set for V, with Tu being the corresponding 
sampling operator. If v E V* with ||n||v* < A, then there exist real numbers {au}u£Z, 
depending only on v such that for every t E V, 

v{t) = y^auujt), 


ueu 


and 


<A\\Tf^^\ 


\\{au)\\RCard(U)* 

Also, ifW contains an interior point vq E /C+ and if v{TffH) > 0 when t G V fl /C+, then 
we may choose a„ > 0. 
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By the help of Lemma 15.41 Lemma 15.91 and Lemma 15.101 we can dednce the following 
probabilistic cnbatnre formnla. 

Proposition 5.11. Let N be a positive integer and 1 < p <2. //A^v := are i.i.d. 

random variables drawn according to arbitrary distribution pi on then there exits a set 
of real numbers {ai}^^^ such that 


N 


Igd 


Qnip^^diijipV^ ^ ^ njti) 


2 = 1 


holds with confidence at least 


subject to 


1 — 2 exp < —C 


N 


D n^ 


+ Cn° 


N 

E 

2=1 


< 


1 — e 


-AA-p. 


Proof. Withont loss of generality, we assnme Qn G r := {/ G : 


< 1}. We 


denote the 5-net of all / G V^, by .4,(5). It follows from IJ, Chap.9] and the dehnition of 
the covering nnmber that the smallest cardinality of 4.(5) is bonnded by 


exp{Cn'^ log 1/5}. 


Given Qn G Let Pj be the polynomial in 4(2“'^) which is closet to Qn in the 
nniform norm, with some convention for breaking ties. Since \\Qn — Pj\\ 0, with the 
denotation pfiP) = |P(A)p — ||P||p, we can write 

OO 

Vi{P) = Vi{Po) + '^Vi{Pi+i) - Vi{Pi). 

1=0 

Since the sampling set Ajv consists of a seqnence of i.i.d. random variables on S'^, the 
sampling points are a seqnence of fnnctions tj = tj{u;) on some probability space (12, P). 
If we set f]{P) = \P{tj)\fi then 

where we have nsed the eqnalities 

Ee|= [ \p{x)\^dpx = \\p\\i. 
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Furthermore, 


It follows from Lemma 15.91 that 


|r/.(P)| < sup l|P(t,(a;))p - ||P|a < ||P||L + ||P||; 


||P|U<Cn^||P|h. 


Hence 

|i)i(P) - E),i(P)| < CD^y. 

Moreover, using Lemma 15.91 again, there holds, 

A'lhiP)) < E(MP)f) < II^’IILII^’IIF Ill’ll* £ CD^y. 

Then, using Lemma [5.41 with e = 1/2 and Mg = we have for hxed P G .4,(1), 

with probability at most 2 exp{—CiV/Pp^n'^}, there holds 

N 


E 

i=l 




> 


Noting there are at expjC'n'^} polynomials in M(l), we get 


iN 


^ ^ for some P G 4.(1) | < 2 exp |- 


CN 


Dp^n'^ 


+ Cn^ 


(5.4) 


Now, we aim to bound the probability of the event: 

(el) for some I > 1, some P G 4(2“^) and some Q G 4(2“*+^) with \\p — g|| < 3 x 2~\ 
there holds 

hiP)-Vi{Q)\ > 

The main tool is also the Bernstein inequality. To this end, we should bound IpiiP) — 
ViiQ) ~ ~ Vi{Q)) ci-nd the variance a‘^{rii{P) — rji^Q))- According to the Taylor 

formula 

= 6^ + (a + 6) (a — 6), 

and Lemma [5.91 we have 

N.(j’) - >).(Q)II < sup ||P((.(..,))r - |Q(i.(.u))n + IIIQIIJ - IIPIIJI 
< CDpy\\P-Ql 
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and 


a\r^,{P) - - r^,{Q)Y) 


ISd 


mx)\^-\Q{x)\rdpx-{\\P\\i-\\Q\\ 


2\2 


< CD,^n^\\P-Qf. 

If P G A{2~'') and Q E A{2~^^^) with ||P — Q|| < 3 x 2~\ then it follows from Lemma 
15.41 again that, 




N 


^ViiP) -ViiQ) 

2=1 


> 


< 2 exp 


N 


' CD p^n^{2-n^ + 2-H‘^) 


N 


Since there are at most 2exp{—C'n'^log/} polynomials in A{2~’‘) U A{2~^~^^), then the 
event (el) holds with probability at most 

C'n'^log/j < ^2exp|-2'/2 
^ 1 = 1 ^ 


OO ^ 

5^2exp - 
1=1 ^ 


' Dp^n<^2-^P 


+ 


CN , 
n 


DpxW^ 


Since ^ ^ for any a > 1 and 6 > 1, we then dednce that 


P”^{The event (el) holds} < 2exp j 

I Dpx^' 


Cn^ 


(5.5) 


Thns, it follows from fl5.4l) and fl5.5p that with conhdence at least 

CN 


1 — 2 exp 


Dpxn<^ 


Cn^ 


there holds 


^ ^ Vi{Qr, 


2=1 


< 


< 




2=1 


OO n 






1 = 1 2=1 
OO 


4-^4(/ + l)^ 
This means that with confidence at least 


i=i 


1 TT^ 1 

4P ^ ^ ^ 2' 


1 — 2 exp 


CN 

Dp xTld 


-CN 
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there holds 


i||0»IIJ<^ElQ„(ai)t<|ll«„li; V(3„ent (5.6) 

i=l 

Now, we use (15. 6 p and Lemma 15.101 to prove Lemma 15.111 In Lemma 15.101 we take 
V = n^, IIQnllv = IIQnllp, and W to be the set of point evaluation functionals 
The operator Tyy is then the restriction map Qn i—t QuIa, with 

ii/iiL 

i=i 

It follows from (15.Oh that with confidence at least 

there holds ||Ty(;^|| < 2. We now take u to be the functional 



y :Qn^ / Qn{x)dpx- 
Jsd- 

By Holder inequality, |||/||v* < 1- Therefore, Lemma 15.101 shows that 


N 


Isd 


Qn{p^^duj{x^ ^ ^ ^iQn(di 


2 = 1 


holds with confidence at least 


1 — 2 exp 


CN 
ID nD 

^Px"' 


-Cn'^ 


subject to 



Then, the Holder finishes the proof of Proposition 15.111 ■ 
To estimate the upper bound of 


we first introduce an error decomposition strategy. It follows from the definition of fz,\,q 
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that, for arbitrary / E 'Hk,z-, 


S{T^Mfz,\,q) -S{fp) < ^{T^MfzXq) “ ^(/p) + 

< S{7lMfz,X,q) - ^z{fz,\,q) + Szif) -^if) 

+ ^z(vrM/z,A,,) + Xni{nMfz,x,q) - SM) " 

+ S{f) - S{f,) + xniif) 

< S{7lMfz,X,q) - i^z(vrM/z,A,g) + ^zif) -^if) 
+ g{f)-S{f^) + Xni{f). 


Since Wr with r ^ 2’ it follows from the Sobolev embedding theorem and Jackson 


inequality 


^ that there exists a. Pp eH^ such that 


Pp\\<c\\fp\\ and \\fp-Ppf<Cn-^^. 


(5.7) 


Then we have 


s{Ux,q)-s{fp) < {£{Pp)-s{fp) + xni{Pp)} 

+ {^(/z,A,,) - £zifz,X,q) + SziPp) - SiPp)} 

=: V{z,X,q) + S{z,X,q), 

where P{z, X, q) and S{z, A, q) are called as the approximation error and sample error, re¬ 
spectively. The following Proposition [5T2] presents an upper bound for the approximation 
error. 

Proposition 5.12. Let m,n G N, r > d/2 and j/ G Wr- Then, with confidence at least 
1 — 2 exp{—cm/(iAp^n'^)}, there holds 

T>{z, X,q) < C -|- 2Am^“'^) , 

where C and c are constants depending only on d and r. 

Proof. From Lemma 12.21 it is easy to deduce that 

-Pp(^) = [ Pp{x')Kn{x,x')duj{x'). 
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Thus, Proposition I5.11( Holder inequality and r > d/2 yield that with conhdence at least 
1—2 exp{—cm/n'^}, there exists a set of real numbers {ai}/^^ satisfying I®* 1"^ — 
for g > 0 such that 

m 

Pp{x) = y^^aiPp{xi)Kn{xi,x). 

i=l 

The above observation together with fl5.7p implies that with conhdence at least 1 — 
2 exp {—Pp can be represented as 

m 

= '^aiPp{xi)Kn{xi,x) e 'Hk,z 
i=l 

such that for arbitrary fp G Wr, there holds 

and 

m m 

niiPp) < Y, |a,r < 2|SV'-^ 

i=\ i=\ 

where C is a constant depending only on d and M. It thus implies that the inequalities 

P(z, A, q) <\\Pp- fpWl + Xniig*) < C (n-2’' + 2Xm^-‘^) (5.8) 

holds with conhdence at least 1 — 2 exp{—cm/(iAp^n'^)}. ■ 

At last, we deduce the hnal learning rate of P kernel regularization schemes (14.11) . 
Firstly, it follows from Propositions 15.12115.81 and 15.51 that 


< Viz, X, q) + S! +SI <C{n-^^ + Xm^-'^) 


+ -i£ifz,X,q) - £ifp)) + 2^ 


holds with conhdence at least 


1 — 4exp{—cm/(Hp^n'^)} — exp < cn'^ log 


4:M^ 3me 1 

- — - \ — 


exp 


3me^ 


e 128M2 J V 48M2 (2n-2’' + £ 
Then, by setting e > e/^ > C(m/logm)“2^/(2''+c«)^ fi = y < it 

follows from r > d/2 that 


1 -5exp{-CD-^me^^^‘^'^^ - exp{-Cme} 
— exp [logl/e + logm) — Cme)'\ 

> 1 — 6 exp{—(Pme}. 
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That is, for e > 


^(/z,A,g) -^(/p) < 66 

holds with conhdence at least 1 — 6 exp{—CD7^me}. The same method as [9|, P.37] and 


the fact that the uniform distribution satishes Dp^ < oo yields the lower bound of (14.41) . 
This hnishes the proof of Theorem 14.11 


6. Conclusion and discussion 


Since its inception in [29|, needlets have become the most popular tools to tackle 


spherical data due to its perfect localization performance in both the frequency and spa- 
cial domains. The main novelty of the present paper is to suggest the usage of the needlet 
kernel in kernel methods to deal with spherical data. Our contributions can be sum¬ 
marized as follows. Firstly, the model selection problem of the kernel ridge regression 
boils down to choosing a suitable kernel and the corresponding regularization parameter. 
Namely, there are totally two types parameters in the kernel methods. This requires 
relatively large amount of computations when faced with large-scaled data sets. Due to 
needlet kernel’s excellent localization property in the frequency domain, we prove that, if 
a truncation operator is added to the hnal estimate, then as far as the model selection is 
concerned, the regularization parameter is not necessary in the sense of rate optimality. 
This means that there is only a discrete parameter, the frequency of the needlet kernel, 
needs tuning in the learning process, which presents a theoretically guidance to reduce 
the computation burden. Secondly, Compared with the kernel ridge regression, kernel 
regularization learning, including the kernel lasso estimate and kernel bridge estimate, 
may bring a certain additional attribution of the estimator, such as the sparsity. When 
utilized the kernel regularization learning, the focus is to judge whether it degrades the 
generalization capability of the kernel ridge regression. Due to needlet kernel’s excellent 
localization property in the spacial domain, we have proved in this paper that, on the 
premise of embodying the feature of the {0 < q < 2) kernel regularization learning, the 
selection of q doesn’t affect the generalization error in the sense of rate optimality. Both 
of them showed that the needlet kernel is an good choice of the kernel method to deal 
with spherical data. 
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We conclude this paper with the following important remark. 


Remark 6.1. There are two types of polynomial kernels for spherical data learning: the 
localized kernels and non-localized kernels. For the non-localized kernels, there are three 
papers focused on its applications in nonparametric regression, is the first one to de¬ 
rive the learning rate of KRR associated with the polynomial kernel {1 + x-x')'^. However 
their learning rate were built upon the assumption that fp is a polynomial. 0 / omitted 
this assumption by using the eigenvalue estimate of the polynomial kernel. But the derived 
learning rate of is not optimal. conducted a learning rate analysis for KRR as¬ 
sociated the reproducing kernel of the space (n^,L 2 (S‘^)) and derived the similar learning 
rate as 0/' In a nutshell, for the spherical data learning, to the best of our knowledge, 
there didn’t exist almost optimal minimax learning rate analysis for KRR associated with 
non-localized kernels. Using the methods in the present paper, especially the technigue 
in bounding the sampling error, we can improve the results in Jj/ and 11 91/ to the almost 

a timal minimax learning rates. For the localized kernels, such as the kernels proposed in 
H, H 0/, we can derive similar results as the needlet kernel in this paper. That is, 
the almost optimal learning rates of KRR and Iq KRS can be derived for these kernels by 
using the same method in the paper. Since needlets’ popularity in statistics and real world 
applications, we only present the learning rate analysis for the needlet kernel. Finally, it 
should be pointed out that when yi = fp^xf), the learning rate of the least sguares (KRR 
with A = Oj associated with a localized kernel was derived in 0 /. The most important dif¬ 
ference between our paper and tl aJ is we are faced with nonparametric regression problem, 
while fldiJ focused on the approximation problems. 
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